Application of machine learning techniques to the modeling of solubility of sugar alcohols in ionic liquids

The current trend of chemical industries demands green processing, in particular with employing natural substances such as sugar-derived compounds. This matter has encouraged academic and industrial sections to seek new alternatives for extracting these materials. Ionic liquids (ILs) are currently paving the way for efficient extraction processes. To this end, accurate estimation of solubility data is of great importance. This study relies on machine learning methods for modeling the solubility data of sugar alcohols (SAs) in ILs. An initial relevancy analysis approved that the SA-IL equilibrium governs by the temperature, density and molecular weight of ILs, as well as the molecular weight, fusion temperature, and fusion enthalpy of SAs. Also, temperature and fusion temperature have the strongest influence on the SAs solubility in ILs. The performance of artificial neural networks (ANNs), least-squares support vector regression (LSSVR), and adaptive neuro-fuzzy inference systems (ANFIS) to predict SA solubility in ILs were compared utilizing a large databank (647 data points of 19 SAs and 21 ILs). Among the investigated models, ANFIS offered the best accuracy with an average absolute relative deviation (AARD%) of 7.43% and a coefficient of determination (R2) of 0.98359. The best performance of the ANFIS model was obtained with a cluster center radius of 0.435 when trained with 85% of the databank. Further analyses of the ANFIS model based on the leverage method revealed that this model is reliable enough due to its high level of coverage and wide range of applicability. Accordingly, this model can be effectively utilized in modeling the solubilities of SAs in ILs.

temperatures but also benefit from high thermal stability and remarkable solubility strength. These characteristics make them potentially attractive tools to overcome various operational challenges 18 associated with conventional solvents. The versatility of ILs allows their feature, thermochemical properties, and solvation power to be designed by adjusting the anion/cation pair appropriately [19][20][21][22][23] .
ILs offer high dissolving capacity for SAs due to the presence of various cations and anions, relatively low melting points, as well as ionic nature and non-volatility due to strong ionic-cationic interaction [24][25][26][27][28][29] . Xia et al. have recently reported the fabrication of cellulose-and lignin-obtained products employing ILs 30 . Accordingly, ILs have remarkable benefits in SAs extraction over conventional solvents. The solubilities of four sugar compounds (i.e., galactose, glucose, xylose, and fructose) in Aliquat®336 and 1-etyhl-3-methylimidazolium ethylsulfate ([Emim][EtSO 4 ]) were measured (288-328 K) by Carneiro et al. and then correlated by two activity coefficient models (ACMs) 31 . Carneiro et al. also developed a theoretical and experimental study addressing the solubilities of sorbitol and xylitol in three ionic liquids in a wide range of temperatures (288-433 K) and evaluated some ACMs for thermodynamic modeling 32 . In another study, they measured the solubilities of sorbitol and xylitol in five different ILs namely 1-butyl-3-methylimidazolium dicyanamide ([Bmim][DCA]), 1-ethyl-3-methylimidazolium dicyanamide ([Emim][DCA]), 1-ethyl-3-methylimidazolium trifluoroacetate ([Emim][TFA]), trihexyltetradecylphosphonium dicyanamide ([P 6,6,6,14 ][DCA]), and Aliquat® dicyanamide at 288-339 K 33 . They also developed a thermodynamic model based on the perturbed-chain statistical associating fluid theory (PC-SAFT) equation of state (EoS). The solubility measurements of fructose and glucose in similar ILs were also done by the same research group 34 . Mohan et al. applied a molecular screening method based on the continuum solvation model to screen a large number of ILs for the solubility of xylose, glucose, fructose, and galactose over a somewhat wide temperature range (303.15 K to 373.15 K) 35 . They benefitted from the same approach to screen ILs for the solubility of sucrose, cellobiose, and maltose 36 37 . Their thermodynamic analysis was based on the PC-SAFT EoS. The same group investigated the solid-liquid equilibria of dicyanamide-based ILs and SAs (erythritol, xylitol, and sorbitol) 38 . A PC-SAFT modeling scheme was also employed to reproduce the measured data 38 . They also reported the impact of functionalized cations on the properties of ILs and their solubility strength for glucose 39 . The same thermodynamic approach utilizing the PC-SAFT approach was also developed in this system. The solubility of six monosaccharide SAs, namely glucose, mannose, fructose, galactose, xylose, and arabinose in different ILs composed of varied cations (1-butyl-3-methylimidazolium and trihexyltetradecylphosphonium) and anions (dicyanamide, dimethylphosphate, and chloride) were determined experimentally (288.2-348.2 K) and their solvation characteristics, as well as molecular-scale mechanisms, were studied by Teles et al. 40 . Asymmetric dicationic ILs have been recently introduced for this process and a pioneer study was developed by Yang et al., in which the impact of 1- (3-(trimethylammonio) prop-1-yl)-3-methylimidazolium bis(dicyanamide), 1-(3-(trimethylammonio)prop-1-yl)-1-methylpiperidinium bis(dicyanamide), and 1-(3-(trimethylammonio)prop-1-yl)pyridinium bis(dicyanamide) on the solubility of fructose and glucose was investigated at 323. 15-353.15 K 41 . ACMs (Wilson, non-random two liquids (NRTL), and UNIQUAC) and semi-empirical equations (modified Apelblat and λh equation) were then applied to model the measured data. More recently, experimental investigations have addressed the solubility data of different compounds in numerous ILs [42][43][44] . Review studies also delve deeply into different aspects of this process 7,17,30 .
These thermodynamic-based calculations (i.e., semi-empirical equations, ACMs, and EoSs) are only applicable to a specific SA-IL system and it is not possible to use them for monitoring the phase equilibrium of several systems simultaneously. On the other hand, the artificial intelligence (AI) approaches can be simply applied to estimate the solubility of a wide range of SAs in different ILs. Hence, any effort leading to the simulating of the solubility of SAs in ILs with the use of machine learning (ML) tools is currently of great interest. ML-based tools have already been engaged in the accurate, fast, and easy-to-use estimation of the equilibrium data 20,45-49 , process assessment 50,51 , the properties of solvents 22,[52][53][54][55][56] , oil reservoirs 57 , gas shales 58 , and biomass-derived materials 59 .
The solubility of SAs in ILs is a strong function of temperature and ILs type, which is identified by their properties. The type and properties of the SA compound also affect the solvation behavior. Early thermodynamic models utilizing EoSs 33,38,39 and ACMs 31,32,34,35 have several disadvantages. To elaborate on this, these models can accurately calculate solubility data only over a narrow and limited range of conditions. Moreover, they are component-specific, which results in the generation of too many parameters for various SA-IL systems. Consequently, a large number of component-specific parameters are found in the literature. On account of this, a universal ML model capable of covering a wide range of SA-IL systems, and temperatures is of paramount interest. From different standpoints, if the application of the ML models gains success, these models can possibly replace the conventional computation methods because of their facile usability and short computation period. ML is now widely utilized to address engineering issues, such as thermophysical property estimation 60,61 .
To benefit from a broad application of sugar alcohols in pharmaceuticals, the food industry, and chemical processes, it is necessary to extract them first. Feasible study, design, and optimization of SA extraction by ionic liquids require precise knowledge about the solubility data. Since the experimental measurement of SAs solubility in ILs is time-consuming and the literature introduces no comprehensive model for its estimation, the present study applies artificial intelligence tools for the considered task. The constructed intelligent model in this study can effectively engage in the simulation and optimization of the SA extraction by ILs. www.nature.com/scientificreports/ analyses that measure the monotonic association between independent-dependent pairs of variables 62 . Pearson, Spearman, and Kendall's analyses benefit from a statistical notion known as covariance, which signifies the degree of correlation or the strength of the relationship between two variables. In other words, these analyses offer a straightforward criterion for how a pair of variables vary together 63,64 .
Pearson's analysis [Eq. (1)] introduces a dimensionless parameter (− 1 to + 1) and Spearman's criterion [Eq. (2)] offers the same range of the criterion and is actually the modified version of Pearson's equation 62 . Even though the range of Spearman's and Pearson's parameters is the same, their quantitative and qualitative prediction of a single independent-dependent pair may differ 62 . For a system composed of a set of input (X) and output (Y) variables the Pearson (r) and Spearman ( r ′ ) coefficients can be calculated as follows: where NDP and d indicate the number of data points and the difference between the two ranks of each observation, correspondingly. Kendall's criterion [Eq. (3)] benefits from a correlation coefficient that is based on the ranks of the observations 64,65 .
By employing the correlation parameters (r, r ′ , and r ′′ ), the relationship between the dependent variable (SAs solubility in ILs) and the independent variables (temperature, the molecular weights (MW) of SA and IL, the density of IL, and the fusion temperature and enthalpy) can be determined based on the regulations presented in Table 1.
The main idea of this study lies in the use of ML models with the simplest procedure. Other than that, the employed parameters can simply represent the nature of the materials in question and distinguish them, since the temperature is the main effective process variable in SA-IL solid-liquid equilibria; MWs can distinguish the compounds and somewhat representative of the molecular length; Fusion enthalpy and temperature are characteristics of the solubility of solids in liquids; And the density of IL shed light on the solubility power of the solvent (IL). These variables are easily available for the entire databank. Other variables mentioned by the reviewer are not available for all the whole compounds, and vaporization temperature does not make sense in this system as the evaporation of ILs is infinitesimal. On account of such a standpoint, these variables were selected for modeling the systems in question.
These analyses and further analysis of the ML models are implemented based on a solubility databank (647 data points of 19 SAs and 21 ILs). The databank is reported in Table 2. The properties of ILs and SAs are also summarized in Tables 3 and 4, respectively.
Machine learning. Artificial neural networks (ANN) have different variants, including multilayer perceptron (MLP), radial basis function (RBF), recurrent neural networks (RNN), general regression neural network (GRNN), and cascade feed-forward neural network (CFFNN) 87,88 . The smallest meaningful section of the ANN is the artificial neurons, which are assigned to performing calculations based on Eq. (4) 89 . www.nature.com/scientificreports/ In Eq. (4), z stands for the neuron's output; while a particular artificial neuron received n inputs ( x i ), and each connection is adjusted by a corresponding weight ( w i ). Moreover, each neuron contains one extra adjusting parameter, which is called bias ( b ). To overcome the restriction of only linear input-output mappings and propose a strategy to model nonlinear relationships, a frequently nonlinear activation function ( ϕ ) is also incorporated in the neuron body. Linear, hyperbolic, tangent, logistic, and Gaussian activation functions can be implemented in the neuron structure 90 .
All of the ANN models include three types of layers: an input layer that receives the independent variables, the output layer that delivers the target prediction, and single or multiple hidden layers which have the task of data processing and recognition 90 . The number of independent features and dependent variables determines the number of elements in the input and output layers, respectively.
The training phase of ANN is responsible for obtaining appropriate values of the bias/weight that provide the best prediction accuracy for a dependent variable. This study applied the following ANN models to find the best one in the calculation of SAs solubilities in a variety of ILs.   64 ] in the hidden layer, whereas its output is a linear combination of neuron parameters and RBF transformation of the inputs [Eq. (8)]. One of the best features of this model is its simplicity and fast-training nature 47 . www.nature.com/scientificreports/ where σ is the spread factor. To obtain the best RBF performance, the number of nodes in the hidden layer and the spread coefficient must determine carefully 47 .
Cascade feed-forward neural network (CFFNN). This model generates a cascade configuration that links the nodes of the input to the hidden and output layers 46 . This model also utilizes the tangent hyperbolic and logarithm sigmoid transfer functions in the hidden and output layers, respectively.
It is worth noting that the learning step alters the connection weights and biases by a predefined optimization algorithm. This optimization algorithm continuously changes the model's weights and biases to minimize the prediction error between the model output and the expected target (real data). This study employs the Levenberg-Marquardt (LM) algorithm to accomplish the training phase of the CFF and MLP models.
General regression neural network (GRNN). Similar to the MLP and CFFNN, this ANN type also constitutes of the input, hidden, and output layers. The last two layers have the Gaussian and linear transfer functions, respectively. The only difference between GRNN and RBFN topologies is that the number of hidden neurons of the earlier is fixed and cannot be manipulated 91 .
Adaptive neuro-fuzzy inference systems (ANFIS). ANFIS is designed by combining fuzzy logic and ANN to benefit from the strength of both models. This model consists of five successive layers, namely the first layer (fuzzy formation), the second layer (fuzzy rules), the third layer (normalization of membership functions), the fourth layer (fuzzy rule conclusion section), and the fifth layer (output calculation). By minimizing the observed error between the predicted and actual responses utilizing an appropriate scenario, the parameters of ANFIS can be adjusted 92 .
Least-squares support vector regression (LSSVR). This method is capable of transferring the independent variables to a multi-dimensional space through the application of kernel functions (K) 92 . The most well-known and widely-employed kernel types that are employed in the LSSVR model are polynomial, linear, and Gaussian.

Results and discussion
This section includes the results of relevancy analysis, ranking analysis, and a detailed investigation of determining the best model for predicting SAs solubility in ILs.
The correlation coefficients (relevancy factors) between dependent and independent variables are calculated by the 3 methods and presented in Fig. 1. To this end, 6 effective parameters, namely temperature, MWs and www.nature.com/scientificreports/ densities of solvents (ILs), MW of solute (SA), and the fusion temperature and enthalpy of the SA were assessed, among which temperature and fusion temperature have apparently the major impact on the solubility. The MW of SAs, as well as the enthalpy of fusion, also have a large impact on the SA solubility in ILs. The observed relevancy factors depict that while the temperature and properties of ILs (MW and density) enhance the solubility of SAs, the features of SAs (MW, fusion temperature, and enthalpy) have the opposite effect. The next analysis presents a ranking test, which draws a comparison among the investigated models. This is addressed in Fig. 2. It is worth noting that this comparison is made at their best structures. The features of the models' pre-assessment, as well as the best performance of each model, are addressed in Tables 5 and 6.
Based on their performance in the training, testing, and combined phases, the models are sorted and compared in this figure. The training phase included 85% of the entire databank. To do so, the rank was calculated based on Eq. (15) and the rank indices already calculated by Eqs. (9)- (14).  www.nature.com/scientificreports/ Results of the ranking test clarify that the ANFIS model offers the best estimations for the solubility data for the training, testing, and entire databank, while RBFNN presents the least accuracies in the same data distribution. As a consequence, the ANFIS model is introduced as the best model and applied to simulate different SA-IL systems in the following sections. This model predicts overall experimental data with the AARD = 7.43%, MAE = 0.017, RAE = 9.28%, MSE = 0.0009, RMSE = 0.03, and R 2 = 0.98260. Figure 3 depicts the calculated solubilities by the ANFIS model ( Fig. 3A-C), as well as the relative deviations (Fig. 3D), against the experimental values. These figures approve that the calculated solubilities in the two training and testing steps are close to the real ones, which indicates the effectiveness of the model. The distribution of the relative deviations also confirms such a statement.
The observations of this figure also depict that overfitting has not occurred in this system. Indeed, when a calculation procedure tends to learn every detail of a system in the training step, and the model then acts inaccurately to estimate the testing data, overfitting has occurred. An indication of overfitting in a system is a small error on the training dataset, while large errors on the test dataset. As a consequence of overfitting, the model is not capable of generalizing the features or patterns that have already been learned in the training phase. A reason for overfitting is often the insufficient distribution of the training and testing datasets viz a small training dataset, which was not the case in this study. The employed distribution of data in this study (85% for the training set) and also the accuracies in the training and testing datasets signify that the ANFIS model did not fall into the overfitting well. This issue can be understood by tracking the residuals and standard deviations in the training and testing categories.
The ingredients of Fig. 4, which depict the residuals ( X exp. i − X calc. i ) distributions indicate that those of the major portion of the dataset in training, testing, and entire datasets fall within the range of ± 0.05 (molar fraction). To this end, the average residual values and standard deviations were calculated based on Eqs. (16) and (17) 64 , respectively. The ANFIS model presents 0.0020004, 0.0019553, and 0.0019937 residuals for the testing, training, and entire datasets, while the standard deviations are 0.029708, 0.031421, and 0.029946, respectively.    Fig. 5. In Eq. (19), M is an NDP × 6 matrix showing the experimental quantities of the independent variable. Then, the leverage method can explore the region a model is applicable with the use of standardized residual information when they are in the range of ± 3. In Fig. 5, this range is identified by dotted lines. Equation (20) is utilized to determine the quantity of critical leverage.  www.nature.com/scientificreports/ The applicability domain of the ANFIS model and the corresponding boundaries are defined in Fig. 5. In a nutshell, the leverage method confirms that the ANFIS model is readily capable of estimating the solubility of SAs in ILs based on the collected databank with high reliability. To elaborate on this, since only 20 data points among 647 solubility samples were identified as either good leverage (Hat Index > critical leverage) or outlier (standardized residuals out of the range of ± 3), the domain of applicability includes larger than 96.9% of the entire databank. On account of these findings, the ANFIS model is reliable owing to its high level of coverage and wide range of applicability.
The solubility of sorbitol in diverse ILs is presented in Fig. 6 and compared with the ANFIS calculations. Clearly, the model can represent the solubility data in the entire range of temperatures and can also distinguish the effect of solvent type as well. Figure 7, which addresses the impact of IL on the xylitol solubility, also depicts that the ANFIS results are accurate enough in the low-to-high solubility ranges. Similar observations for the solubility of fructose in different ILs have existed in Fig. 8. It is inferred from this figure that even sharp solubility changes with the temperature can be simulated by the ANFIS model with remarkable precision.
The solubility behavior of fructose, glucose, and sucrose in [C 4 C 1 Im][CF 3 CO 2 ] IL is presented in Fig. 9, which signifies that they are well estimated by the ANFIS model within the entire range of temperatures. www.nature.com/scientificreports/ [C 4 C 1 Im][(OCH 3 ) 2 PO 4 ] IL that offers low-to-high solubility capacity for different SA compounds, including xylose, glucose, and fructose is assessed and compared to the ANFIS estimations in Fig. 10. It can be seen that the model can describe the solubility behavior of different IL-SA pairs, from low to high solubility range, with remarkable accuracy.
The impact of the SA compound and temperature on the absorption capacity of [bmim][DCA] IL is compared in Fig. 11. Despite some minor discrepancies in the higher solubility range, the ANFIS model can represent the real data accurately. It is worth discussing that the collected dataset and the discrepancies of different datasets also affect the accuracy of the ANFIS model and a major portion of the observed error arises from these scattering. This issue is well magnified in Fig. 11B. As per the figure, there is more than one reference for the collected data of the solubilities of sorbitol, glucose, and fructose in [bmim] [DCA], and the reported quantity and even trends in each case vary to a great extent, which generates uncertainties in the model behavior. Table 7 summarizes the performance of the ANFIS model for predicting the phase equilibrium of various SA-IL pairs. The largest deviation is observed in the case of D-xylitol solubility in [bmim][C(CN) 3 ] (ARD = 9.72% and AARD = 29.62%) and the largest relative deviations (Min RD = − 28.02% and Max RD = 75.23%) belong to a data point from the same system as well. Nevertheless, the maximum AARD% does not exceed 10% in the majority of SA-IL pairs, which signifies the accuracy of the ANFIS model in representing the solubility data  Although ML studies, which consider the systems in question, cannot be for the time being found in the literature, a comparison between any available modeling approach with the one developed herein is of great interest and can then shed light on the quality of ML models. The previous calculation procedures for the solubility of SAs in ILs include thermodynamic modeling that benefits from the use of ACMs mainly including NRTL and UNIQUAC and EoS such as PC-SAFT.
To this end, PC-SAFT is a popular EoS that can benefit from either predictive or correlative schemes. Carneiro et al. 33 developed calculations based on this method for the solubilities of xylitol and sorbitol in 1-ethyl-3-methylimidazolium dicyanamide, 1-butyl-3-methylimidazolium dicyanamide, Aliquat® dicyanamide, trihexyltetradecylphosphonium dicyanamide, and 1-ethyl-3-methylimidazolium trifluoroacetate at 288-339 K. In the whole systems investigated, they obtained 3.7-112.2% and 3.3-21.7% deviations when the predictive and correlative approaches were employed, respectively. The use of a fitting parameter in the calculations, which was determined based on the regression of the solubility data, notably improved the accuracy of calculations. Paduszynski et al. 37 also benefitted from the same approach with some minor modifications for the system www.nature.com/scientificreports/ including 1-butyl-3-methylimidazolium dicyanamide 1-butyl-3-methylimidazolium trifluoroacetate ILs and glucose, fructose, and sucrose SAs. Then, they reported very poor agreement between the calculations and the measurements in the predictive mode. Benefitting from two adjustable parameters and the regression of solubility temperatures, they improved the accuracy of the model considerably. Their further study also reported the same trends 38 . Although this method of calculation was successful and is in many cases comparable to the ML calculations in this study, it demands a two-step regression procedure including the optimization of pure and binary data. The solubilities of glucose, fructose, xylose, and galactose in two ILs namely the 1-etyhl-3-methylimidazolium ethylsulfate (also known as [emim][EtSO 4 ]) and the Aliquat®336 at 288-328 K were modeled by Carneiro and co-workers 31 by the use of NRTL and UNIQUAC equations. The AARD% obtained for the two equations did not exceed 4% (based on molar fractions) in almost all cases, and no significant difference between the two equations was observed. This team 32 then utilized a similar methodology within the same ACMs and an e-NRTL equation for the systems composed of xylitol and sorbitol and several ILs. Their calculations based on NRTL, UNIQUAC, and e-NRTL ACMs resulted in 0.9-3.7%, 1.1-3.2%, and 0.7-2.6% deviations, respectively. Similar observations were reported in the further studies of the same research group 34 . Compared to the ML models, the thermodynamic models based on activity coefficient equations demand more sophisticated calculations as  www.nature.com/scientificreports/ well as a regression-based procedure, which can lead to the accumulation of a large number of parameters for a vast number of SA-IL binary systems.

Conclusions
Ionic liquids have recently been introduced to enhance the development of sugar-derived compounds and their efficient extraction. This study is the first attempt to develop several machine learning models for predicting the solubility of sugar alcohols in ionic liquids. Machine learning models were implemented using 647 solubility samples of 19 sugar alcohols in 21 ionic liquids collected from the literature. After detecting the effective variables, i.e., temperature, molecular weight and density of ILs, the molecular weight of SAs, the fusion temperature, and enthalpy, artificial neural networks, least-squares support vector regression, and adaptive neuro-fuzzy inference system (ANFIS) was appraised among which, ANFIS was the superior one. The accuracy of this model was approved by an R 2 of 0.98359 and an AARD of 7.43% for estimating the entire databank. On the contrary, the radial basis neural network is identified as the worst model with AARD = 18.21% and R 2 = 0.93202. Checking the ANFIS model predictions by the leverage method showed that this model is reliable because of its broad range of applicability and a remarkable level of coverage. The results of this investigation can contribute to the screening of ionic liquid solvents for the appropriate extraction of sugar alcohols. Moreover, ANFIS models can be efficiently employed for solubility estimation in the investigated SA-IL systems. www.nature.com/scientificreports/